近年来,使用端到端的神经网络,语音增强率有了很大的改善。但是,大多数模型对语音含量不可知。最近,一些研究提出了语音感知语音的增强,主要是使用感知监督。但是,在模型优化过程中注入语音特征可以采用其他表格(例如,模型调节)。在本文中,我们在语音增强模型中纳入语音信息的不同方法之间进行了系统的比较。通过进行一系列受控的实验,我们观察到不同的语音含量模型以及各种特征注入技术对增强性能的影响,即考虑因果和非作用模型。具体而言,我们评估了三个设置以注入语音信息,即:i)特征条件; ii)感知监督;和iii)正则化。使用监督的预训练的自动语音识别(ASR)模型或使用预训练的自我监督学习(SSL)模型的中间层获得语音特征。我们进一步观察选择不同的嵌入层对性能的影响,考虑手动和学习的配置。结果表明,在大多数情况下,使用SSL模型作为语音特征优于ASR。有趣的是,调节设置在评估的配置中表现最好。
translated by 谷歌翻译
学习一种新语言涉及不断比较语音作品与环境的参考作品。在言语获取的早期,孩子们进行了发音调整以符合他们的看护人的言论。一种语言的成年学习者调整他们的演讲以匹配导师参考。本文提出了一种合成产生正确的发音反馈的方法。此外,我们的目标是在保持演讲者的原始声音的同时产生校正后的生产。该系统提示用户发音短语。记录语音,并用与不准确音素相关的样品用零掩盖。该波形是对语音生成器的输入,作为具有U-NET体系结构的深度学习介绍系统实现,并经过培训以输出重建的语音。该训练集由未损坏的适当语音示例组成,并且对发电机进行了训练以重建原始的适当语音。我们评估了系统的性能在音素替代英语以及发音障碍儿童的最小对单词方面的性能。结果表明,人类听众稍微偏爱我们产生的语音,而不是用不同的扬声器的生产来平滑地替换音素。
translated by 谷歌翻译
语音情感转换是修改语音话语的感知情绪的任务,同时保留词汇内容和扬声器身份。在这项研究中,我们将情感转换问题作为口语翻译任务。我们将演讲分解为离散和解散的学习表现,包括内容单位,F0,扬声器和情感。首先,我们通过将内容单元转换为目标情绪来修改语音内容,然后基于这些单元预测韵律特征。最后,通过将预测的表示馈送到神经声码器中来生成语音波形。这样的范式允许我们超越信号的光谱和参数变化,以及模型非口头发声,例如笑声插入,打开拆除等。我们客观地和主观地展示所提出的方法在基础上优于基线感知情绪和音频质量。我们严格评估了这种复杂系统的所有组成部分,并通过广泛的模型分析和消融研究结束,以更好地强调建议方法的建筑选择,优势和弱点。示例和代码将在以下链接下公开使用:https://speechbot.github.io/emotion。
translated by 谷歌翻译
We address the challenge of building domain-specific knowledge models for industrial use cases, where labelled data and taxonomic information is initially scarce. Our focus is on inductive link prediction models as a basis for practical tools that support knowledge engineers with exploring text collections and discovering and linking new (so-called open-world) entities to the knowledge graph. We argue that - though neural approaches to text mining have yielded impressive results in the past years - current benchmarks do not reflect the typical challenges encountered in the industrial wild properly. Therefore, our first contribution is an open benchmark coined IRT2 (inductive reasoning with text) that (1) covers knowledge graphs of varying sizes (including very small ones), (2) comes with incidental, low-quality text mentions, and (3) includes not only triple completion but also ranking, which is relevant for supporting experts with discovery tasks. We investigate two neural models for inductive link prediction, one based on end-to-end learning and one that learns from the knowledge graph and text data in separate steps. These models compete with a strong bag-of-words baseline. The results show a significant advance in performance for the neural approaches as soon as the available graph data decreases for linking. For ranking, the results are promising, and the neural approaches outperform the sparse retriever by a wide margin.
translated by 谷歌翻译
Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the bio-medical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect the annotation entity's interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of Peak Ground Truth (PGT) is introduced. PGT marks the point beyond which an increase in similarity with the reference annotation stops translating to better Real World Model Performance (RWMP). Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, three categories of PGT-aware strategies to evaluate and improve model performance are reviewed.
translated by 谷歌翻译
Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.
translated by 谷歌翻译
In recent years distributional reinforcement learning has produced many state of the art results. Increasingly sample efficient Distributional algorithms for the discrete action domain have been developed over time that vary primarily in the way they parameterize their approximations of value distributions, and how they quantify the differences between those distributions. In this work we transfer three of the most well-known and successful of those algorithms (QR-DQN, IQN and FQF) to the continuous action domain by extending two powerful actor-critic algorithms (TD3 and SAC) with distributional critics. We investigate whether the relative performance of the methods for the discrete action space translates to the continuous case. To that end we compare them empirically on the pybullet implementations of a set of continuous control tasks. Our results indicate qualitative invariance regarding the number and placement of distributional atoms in the deterministic, continuous action setting.
translated by 谷歌翻译
Artificial Intelligence (AI) has become commonplace to solve routine everyday tasks. Because of the exponential growth in medical imaging data volume and complexity, the workload on radiologists is steadily increasing. We project that the gap between the number of imaging exams and the number of expert radiologist readers required to cover this increase will continue to expand, consequently introducing a demand for AI-based tools that improve the efficiency with which radiologists can comfortably interpret these exams. AI has been shown to improve efficiency in medical-image generation, processing, and interpretation, and a variety of such AI models have been developed across research labs worldwide. However, very few of these, if any, find their way into routine clinical use, a discrepancy that reflects the divide between AI research and successful AI translation. To address the barrier to clinical deployment, we have formed MONAI Consortium, an open-source community which is building standards for AI deployment in healthcare institutions, and developing tools and infrastructure to facilitate their implementation. This report represents several years of weekly discussions and hands-on problem solving experience by groups of industry experts and clinicians in the MONAI Consortium. We identify barriers between AI-model development in research labs and subsequent clinical deployment and propose solutions. Our report provides guidance on processes which take an imaging AI model from development to clinical implementation in a healthcare institution. We discuss various AI integration points in a clinical Radiology workflow. We also present a taxonomy of Radiology AI use-cases. Through this report, we intend to educate the stakeholders in healthcare and AI (AI researchers, radiologists, imaging informaticists, and regulators) about cross-disciplinary challenges and possible solutions.
translated by 谷歌翻译
Understanding our brain is one of the most daunting tasks, one we cannot expect to complete without the use of technology. MindBigData aims to provide a comprehensive and updated dataset of brain signals related to a diverse set of human activities so it can inspire the use of machine learning algorithms as a benchmark of 'decoding' performance from raw brain activities into its corresponding (labels) mental (or physical) tasks. Using commercial of the self, EEG devices or custom ones built by us to explore the limits of the technology. We describe the data collection procedures for each of the sub datasets and with every headset used to capture them. Also, we report possible applications in the field of Brain Computer Interfaces or BCI that could impact the life of billions, in almost every sector like healthcare game changing use cases, industry or entertainment to name a few, at the end why not directly using our brains to 'disintermediate' senses, as the final HCI (Human-Computer Interaction) device? simply what we call the journey from Type to Touch to Talk to Think.
translated by 谷歌翻译
Modern mobile burst photography pipelines capture and merge a short sequence of frames to recover an enhanced image, but often disregard the 3D nature of the scene they capture, treating pixel motion between images as a 2D aggregation problem. We show that in a "long-burst", forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth. To this end, we devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion. Our plane plus depth model is trained end-to-end, and performs coarse-to-fine refinement by controlling which multi-resolution volume features the network has access to at what time during training. We validate the method experimentally, and demonstrate geometrically accurate depth reconstructions with no additional hardware or separate data pre-processing and pose-estimation steps.
translated by 谷歌翻译